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Abstract — Sampling-based methods have previously been pro- 
posed for the problem of finding interesting associations in data, 
even for low-support items. While these methods do not guarantee 
precise results, they can be vastly more efficient than approaches 
that rely on exact counting. However, for many similarity mea- 
sures no such methods have been known. In this paper we show 
how a wide variety of measures can be supported by a simple 
biased sampling method. The method also extends to find high- 
confidence association rules. We demonstrate theoretically that 
our method is superior to exact methods when the threshold for 
"interesting similarity/confidence" is above the average pairwise 
similarity/confidence, and the average support is not too low. 
Our method is particularly good when transactions contain many 
items. We confirm in experiments on standard association mining 
benchmarks that this gives a significant speedup on real data 
sets (sometimes much larger than the theoretical guarantees). 
Reductions in computation time of over an order of magnitude, 
and significant savings in space, are observed. 

Index Terms — algorithms; sampling; data mining; association 
rules. 

I. Introduction 

A central task in data mining is finding associations in a 
binary relation. Typically, this is phrased in a "market basket" 
setup, where there is a sequence of baskets (from now on 
"transactions"), each of which is a set of items. The goal 
is to find patterns such as "customers who buy diapers are 
more likely to also buy beer". There is no canonical way of 
defining whether an association is interesting — indeed, this 
seems to depend on problem-specific factors not captured by 
the abstract formulation. As a result, a number of measures 
exist: In this paper we deal with some of the most common 
measures, including Jaccard |[T|, lift |[3), cosine, and 
all_confidence f4], f5l|. In addition, we are interested in high- 
confidence association rules, which are closely related to 
the overlap coefficient similarity measure. We refer to [6, 
Chapter 5] for general background and discussion of similarity 
measures. 

In the discussion we limit ourselves to the problem of binary 
associations, i.e., patterns involving pairs of items. There is a 
large literature considering the challenges of finding patterns 
involving larger item sets, taking into account the aspect of 
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which contains stronger theoretical results and fixes a mistake in the reporting 
of experiments. 



time, multiple-level rules, etc. While some of our results can 
be extended to cover larger item sets, we will for simplicity 
concentrate on the binary case. Previous methods rely on one 
of the following approaches: 

1) Identifying item pairs that "occur frequently to- 
gether" in the transactions — in particular, this means 
counting the number of co-occurrences of each such pair 
— or 

2) Computing a "signature" for each item such that the 
similarity of every pair of items can be estimated by 
(partially) comparing the item signatures. 

Our approach is different from both these approaches, and 
generally offers improved performance and/or flexibility. In 
some sense we go directly to the desired result, which is the 
set of pairs of items with similarity measure above some user- 
defined threshold A. Our method is sampling based, which 
means that the output may contain false positives, and there 
may be false negatives. However, these errors are rigorously 
understood, and can be reduced to any desired level, at some 
cost of efficiency — our experimental results are for a false 
negative probability of less than 2%. The method for doing 
sampling is the main novelty of this paper, and is radically 
different from previous approaches that involve sampling. 

The main focus in many previous association mining papers 
has been on space usage and the number of passes over the 
data set, since these have been recognized as main bottlenecks. 
We believe that time has come to also carefully consider CPU 
time. A transaction with b items contains (2) item pairs, and 
if h is not small the effort of considering all pairs is non- 
negligible compared to the cost of reading the item set. This 
is true in particular if data resides in RAM, or on a modern 
SSD that is able to deliver data at a rate of more than a 
gigabyte per second. One remedy that has been used (to reduce 
space, but also time) is to require high support, i.e., define 
"occur frequently together" such that most items can be thrown 
away initially, simply because they do not occur frequently 
enough (they are below the support threshold). However, as 
observed in 11] this means that potentially interesting or useful 
associations (e.g. correlations between genes and rare diseases) 
are not reported. In this paper we consider the problem 
of finding associations without support pruning. Of course, 
support pruning can still be used to reduce the size of the data 
set before our algorithms are applied. 

In the following sections we first discuss the need for 



focusing on CPU time in data mining, and then elaborate on 
the relationship between our contribution and related works. 

A. I/O versus CPU 

In recent years, the capacity of very fast storage devices 
has exploded. A typical desktop computer has 4—16 GB of 
RAM, that can be read (sequentially) at a speed of at least 
800 million 32-bit words per second. The flash-based ioDrive 
Duo of Fusion-io offers up to over a terabyte of storage that 
can be read at around 400 million 32-bit words per second. 
Thus, even massive data sets can be read at speeds that make it 
challenging for CPUs to keep up. An 8-core system must, for 
example, process 100 million (or 50 million) items per core 
per second. At 3 GHz this is 33 clock cycles (or 66 clock 
cycles) per item. This means that any kind of processing that 
is not constant time per item (e.g., using time proportional to 
the size of the transaction containing the item) is likely to be 
CPU bound rather than I/O bound. For example, a hash table 
lookup requires on the order of 5-10 ns even if the hash table is 
L2 cache-resident (today less than 10 MB per core). This gives 
an upper limit of 100-200 million lookups per second in each 
core, meaning that any algorithm that does more than a dozen 
hash table operations per item (e.g. updating the count of some 
item pairs) is definitely CPU bound, rather than I/O bound. 
In conclusion, we believe it is time to carefully consider 
optimizing internal computation time, rather than considering 
all computation as "free" by only counting I/Os or number of 
passes. Once CPU efficient algorithms are known, it is likely 
that the remaining bottleneck is I/O. Thus, we also consider 
I/O efficient versions of our algorithm. 

B. Previous work 

Exact counting of frequent item sets: The approach pio- 
neered by the A-Priori algorithm ||7|, |[8), and refined by many 
others (see e.g. |[9|-|[T3|), allows, as a special case, finding all 
item pairs that occur in more than k transactions, for 

a specified threshold k. However, for the similarity measures 
we consider, the value of k must in general be chosen as 
a low constant, since even pairs of very infrequent items can 
have high similarity. This means that such methods degenerate 
to simply counting the number of occurrences of all pairs, 
spending time Q{b^) on a transaction with 6 items. Also, 
generally the space usage of such methods (at least those 
requiring a constant number of passes over the data) is at least 
1 bit of space for each pair that occurs in some transaction. 

The problem of counting the number of co-occurrences of 
all item pairs is in fact equivalent to the problem of multiplying 
sparse 0-1 matrices. To see this, consider the n x m matrix A in 
which each row A^ is the incidence vector having 1 in position 
p iff the ith element in the set of items appears in the pth 
transaction. Each entry A^ j- of the nxn matrix A = A x A"'" 
represents the number of transactions in which the pair 
appears. The best theoretical algorithms for (sparse) matrix 
multiplication |14|-|16| scale better than the A-Priori family 
of methods as the transaction size gets larger, but because of 
huge constant factors this is so far only of theoretical interest. 



Sampling transactions: Toivonen |17| investigated the use 
of sampling to find candidate frequent pairs Take a 

small, random subset of the transactions and see what pairs 
are frequent in the subset. This can considerably reduce the 
memory used to actually count the number of occurrences 
(in the full set), at the cost of some probability of missing a 
frequent pair This approach is good for high-support items, 
but low-support associations are likely to be missed, since few 
transactions contain the relevant items. 

Locality-sensitive hashing: Cohen et al. |[T| proposed the 
use of another sampling technique, called min-wise indepen- 
dent hashing, where a small number of occurrences of each 
item (a "signature") is sampled. This means that occurrences 
of items with low support are more likely to be sampled. As a 
result, pairs of (possibly low-support) items with high jaccard 
coefficient are found — with a probability of false positives 
and negatives. A main result of l^lj is that the time complexity 
of their algorithm is proportional to the sum of all pairwise 
jaccard coefficients, plus the cost of initially reading the data. 
Our main result has basically the same form, but has the 
advantage of supporting a wide class of similarity measures. 

Min-wise independent hashing belongs to the class of 
locality-sensitive hashing methods p8) . Another such method 
was described by Charikar fT9l, who showed how to compute 
succinct signatures whose Hamming distance reflects angles 
between incidence vectors. This leads to an algorithm for find- 
ing item pairs with cosine similarity above a given threshold 
(again, with a probability of false positives and negatives), 
that uses linear time to compute the signatures, and Q{n^) 
time to find the similar pairs, where n is the number of 
distinct items in all transactions. Charikar also shows that 
many similarity measures, including some measures supported 
by our algorithm, cannot be handled using the approach of 
locality-sensitive hashing. 

Deterministic signature methods: In the database commu- 
nity, finding all pairs with similarity above a given threshold is 
sometimes referred to as a "similarity join." Recent results on 
similarity joins include fSOl-fSJi. While not always described 
in this way, these methods can be seen as deterministic 
analogues of the locaUty-sensitive hashing methods, offering 
exact results. The idea is to avoid computing the similarity 
of every pair by employing succinct "signatures" that may 
serve as witnesses for low similarity. Most of these methods 
require the signatures of every pair of items to be (partially) 
compared, which takes time. However, the worst-case 

asymptotic performance appears to be no better than the A- 
Priori family of methods. A similarity join algorithm that 
runs faster than f2(n^) in some cases is described in po) . 
However, this algorithm exhibits a polynomial dependence on 
the maximum number k of differences between two incidence 
vectors that are considered similar, and for many similarity 
measures the relevant value of k may be linear in the number 
m of transactions. 



C. Our results 

In this paper we present a novel sampling technique to 
handle a variety of measures (including jaccard, lift, cosine, 
and all confidence), even finding similar pairs among low 
support items. The idea is to sample a subset of all pairs 
occurring in the transactions, letting the sampling probability 
be a function of the supports of i and j. For a parameter fi, 
the probability is chosen such that each pair with similarity 
above a threshold A (an "interesting pair") will be sampled 
at least fi times, in expectation, while we do not expect to 
see a pair (i, j) whose measure is significantly below A. A 
naive implementation of this idea would still use quadratic 
time for each transaction, but we show how to do the sampling 
in near-linear time (in the size of the transaction and number 
of sampled pairs). 

The number of times a pair is sampled follows a binomial 
distribution, which allows us to use the sample to infer 
which pairs are likely to have similarity above the threshold, 
with rigorous bounds on false negative and false positive 
probabilities. We show that the time used by our algorithm 
is (nearly) linear in the input size and in the the sum of all 
pairwise similarities between items, divided by the threshold 
A. This is (close to) the best complexity one could hope for 
with no conditions on the distribution of pairwise similarities. 
Under reasonable assumptions, e.g. that the average support 
is not too low, this gives a speedup of a factor Vl{h/ \ogh), 
where h is the average size of a transaction. 

We show in extensive experiments on standard data sets 
for testing data mining algorithms that our approach (with a 
1.8% false negative probability) gives speedup factors in the 
vicinity of an order of magnitude, as well as significant savings 
in the amount of space required, compared to exact counting 
methods. We also present evidence that for data sets with many 
distinct items, our algorithm may perform significantly less 
work than methods based on locality-sensitive hashing. 

D. Notation 

Let Ti,...,Tm be a sequence of transactions, Tj C [n]. 
For i — 1, . . . ,n let Si ~ {j \ i G Tj}, i.e.. Si is the set of 
occurrences of item i. 

We are interested in finding associations among items, 
and consider a framework that captures the most common 
measures from the data mining literature. Specifically, we can 
handle a similarity measure s{i,j) if there exists a function 
/ : N X N X R_|_ that is non-increasing in all 

parameters, and such that: 

\s.ns,\fi\s,l\s,lsii,j)) = i . 

In other words, the similarity should be the solution to an 
equation of the form given above. Fig.[T]shows particular mea- 
sures that are special cases. The monotonicity requirements 
on / hold for any reasonable similarity measure: increasing 
\Si n Sj\ should not decrease the similarity, and adding an 
occurrence of i or j should not increase the similarity unless 
\Si n Sj\ increases. In the following we assume that / is 
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Fig. 1. Some measures covered by our algorithm and the 
corresponding functions. Note that finding all pairs with 
overlap coefficient at least A implies finding all association 
rules with confidence at least A. 



computable in constant time, which is clearly a reasonable 
assumption for the measures of Fig. [T] 

II. Our algorithm 

The goal is to identify pairs where s{i,j) is "large". 
Given a user-defined threshold A we want to report the pairs 
where s{i,j) > A. We observe that all measures in Fig. [T| are 
symmetric, so it suffices to find all pairs where \Si\ < 

\Sj\,t^j, and s(^,j) > A. 

A. Algorithm idea 

Our algorithm is randomized and finds each qualifying 
pair with probability 1 — e, where e > is a user-defined 
error probability. The algorithm may also return some false 
positives, but each false positive pair is likely to have similarity 
within a small constant factor of A. If desired, the false 
positives can be reduced or eliminated in a second phase, but 
we do not consider this step here. 

The basic idea is to randomly sample pairs of items that 
occur together in some transaction such that for any pair (i, j) 
the expected number times it is sampled is a strictly increasing 
function of s{i,j). Indeed, in all cases except the jaccard 
measure it is simply proportional to s{i,j). We scale the 
sampling probability such that for all pairs with s{i,j) > A 
we expect to see at least /i occurrences of (i, j), where is a 
parameter (defined later) that determines the error probability. 

B. Implementation 

Fig. [2] shows our algorithm, called BiSam (for /j/ased 
iompling). The algorithm iterates through the transactions, and 
for each transaction Tt adds a subset of Tt x Ti to a multiset 
M in time that is linear in \Ti\ and the number of pairs 
added. We use Tt[i] to denote the ith item in Tf. Because 
/ is non-increasing and Tt is sorted according to the order 
induced by c(-) we will add {Tt[i],Tt[j]) e Tt x Tt if and 
only if f{c{Tt\i]),c{Tt[j]),A)fj. > r. The second loop of the 
algorithm builds an output set containing those pairs that 
either occur at least times in M, or where the number of 
occurrences in M imply that s{i,j) > A (with probability 1). 



procedure BiSAM(ri, . . . , T^; /, ^i, A) 

c :=ItemCount(Ti, . . . ,Tm); 
M := 0; 

for < := 1 to TO do 

sort Tt[] s.t. c{Tt[j]) < c{Tt[j + 1]) for 1 < j < \Tt\; 
let r be a random number in [0; 1); 
for i := 1 to \Tt \ do 
j:=i+l; 

while j < \Tt\ and /(c(Tt W), c(T4j]), A)m > r 
M := M U {iTt[2],Ttm; 

j:=i+i; 

end 
end 
end 

i? = 0; 

for e M do 

if Mii,j) > fi/2 or Af(i,j)/(c(z),c(j), A) > 1 then 
i?:=i?U{(i,j)}; 
return i?; 
end 

Fig. 2. Pseudocode for the BiSam algorithm. The procedure 
ItemCount(-) returns a function (hash map) that contains 
the number of occurrences of each item. Tf [j] denotes the 
jth item in transaction t. M is a multiief that is updated by 
inserting certain randomly chosen pairs The number of 

occurrences of a pair is denoted M{i,j). 



The best implementation of the subprocedure ItemCount 
depends on the relationship between available memory and 
the number n of distinct items. If there is sufficient internal 
memory, it can be efficiently implemented using a hash table. 
For larger instances, a sort-and-count approach can be used 
(Section |III-B[ ). The multiset M can be represented using a 
hash table with counters (if it fits in internal memory), or 
more generally by an external memory data structure. In the 
following we first consider the standard model (often referred 
to as the "RAM model"), where the hash tables fit in internal 
memory, and assume that each insertion takes constant time. 
Then we consider the I/O model, for which an I/O efficient 



implementation is discussed. As we will show in Section IV 
a sufficiently large value of /i is 81n(l/e). Fig. |5] shows more 
exact, concrete values of n and corresponding false positive 
probabilities e. 

Example. Suppose ItemCount has been run and the 
supports of items 1-6 are as shown in Fig. [3] Suppose now 
that the transaction Tt = {6, 5, 4, 3, 2, 1} is given. Note that its 
items are written according to the number of occurrences of 
each item. Assuming the similarity measure is cosine, ji — 10, 
A = 0.7, and r for this transaction equal to 0.9, our algorithm 
would select from Tt x Tt the pairs shown in Fig. |4] 

Suppose that after processing all transactions the pair (6, 5) 
occurs 3 times in M, (6, 4) occurs twice in M, (6, 1) occurs 
once in M, and (5, 4) occurs 4 times in M. Then the algorithm 
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Fig. 3. Items in the example, with corresponding ItemCount 
values. 
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Fig. 4. Pairs selected from Tt in the example. Notice that 
after realizing the pair (5,3) does not satisfy the inequality 
f{c{5), c(3), A)/i > r, the algorithm will not take into account 
the pairs (5, 2) and (5, 1). 



would output the pairs: (6,5) (since Af(6, 5) < /i/2 but 
M (6, 5)/(3, 5, 0.7) > 1), and (5, 4) (same situation as before). 

III. Analysis of running time 
Our main lemma is the following: 

Lemma 1: For all pairs (i, j), where i ^ j and c{i) < c{j), 
if /(c(z), c(j), A)/i < 1 then at the end of the procedure, 
M{i,j) has binomial distribution with |5inS'j| trials and mean 

\s,ns,\fi\s,l\s,lA)f,. 

If f {c{i) , c{j) , A) > 1 then at the end of the procedure 

Mii,j)^\s,nSj\. 

Proof As observed above, the algorithm adds the pair 
to M in iteration t if and only if € Tt x Tt and 
f{c{i),c{j), A)^ > r, where r is the random number in [0; 1) 
chosen in iteration t. This means that for every t E Si Sj 
we add to M with probability min(l, f{c{i), c{j), A)/i). 
In particular, M{i,j) = \St n Sj\ for /(c(j), c(j), A)^ > 1. 
Otherwise, since the value of r is independently chosen for 
each t, the distribution of M{i,j) is binomial with \Si n Sj\ 
trials and mean \Si n Sj\f{c{i), c{j), A)fi. ■ 

Looking at Fig. [T] we notice that for the jaccard similarity 



measure s{i,j) 
tion is 

Is^nSA 1 



\S^] + 15, 



A 



|s.]+|s,|-|s.ns,|' 
A s(i,j)(l 



the mean of the distribu- 



A) 



<2fisitJ)/A, 



'il + .3{lJ))A 

where the inequality uses s(i,j),A G [0; 1]. For all other 
similarity measures the mean of the binomial distribution is 
fi s{i, j)/ A. As a consequence, for all these measures, pairs 
with similarity below (1 — e)A will be counted exactly, or 
sampled with mean (1 — f2(e))/i. Also notice that for all the 
measures we consider, 

\S^,nS,\f{\S,l\S,\,A) = Oisii,j)/A). 



We provide a running time analysis both in the standard 
(RAM) model and in the I/O model of Aggarwal and Vit- 
ter [241 . the latter case we present an external memory 
efficient implementation of the algorithm, IOBiSam. Let b 
denote the average number of items in a transaction, i.e., there 
are bni items in total. Also, let z denote the number of pairs 
reported by the algorithm. 

A. Running time in the standard model 

The first and last part of the algorithm clearly runs in 
expected time 0{mb + z). The time for reporting the result is 
dominated by the time used for the main loop, but analyzing 
the complexity of the main loop requires some thought. The 
sorting of a transaction with 61 items takes 0(5ilog6i) 
time, and in particular the total cost of all sorting steps is 
0(m6 log n)Q 

What remains is to account for the time spent in the while 
loop. We assume that \SiC^Sj\f{\S,l |5'j|, A) = 0(s(i, j)/A), 
which is true for all the measures we consider The time 
spent in the while loop is proportional to the number of items 
sampled, and according to Lemma [T| the pair will be 

sampled n A)/x = 0{n s{i, j) / /S.) times 

in expectation if f {c{i) , c{j) , A) fi < 1, and \Si n Sj\ times 
otherwise. In both cases, the expected number of samples is 
0{s{i, j)-^). Summing over all pairs we get the total time 
complexity. 

Theorem 2: Suppose we are given transactions Ti, . . . , T^, 
each a subset of [n], with mb items in total, and that / is 
the function corresponding to the similarity measure s. Also 
assume that 

|5.n 5,1/(15.1,15,1, A) = 0(s(z,j)/A). 

Then the expected time complexity of 
BiSam(Ti, . . . ,Tm; /, A) in the standard model is; 

olmb\ogn+^ Y . (1) 

\ l<i<i<ri / 

a) Discussion: This result is close to the best we could 
hope for with no condition on the distribution of pairwise sim- 
ilarities. The first term is near-linear in the input size, and the 
output size z may be as large as ^{A^^ J2i<i<j<n ^i^^ j))- 
This happens if the average similarity among the pairs reported 
is 0(A), and the total similarity among other pairs is low and 
does not dominate the sum. For such inputs, the algorithm 
runs in 0{mb\ogn + fiz) time, and clearly Q.{mb + z) time 
is needed by any algorithm. 

A comparison can be made with the complexity of schemes 
counting the occurrences of all pairs. Such methods use time 
il{mb^), which is a factor 6/logn larger than the first term. 
In fact, the difference will be larger if the distribution of 

'We remark that if 0{mb) internal memory is available, two applications of 
radix sorting could be used to show a theoretically stronger result, by sorting 
all transactions in 0{mb) time, following the same approach as the external 
memory variant. 



transaction sizes is not even — and in particular the difference 
in time will be at least a factor b/ log b (but this requires a more 
thorough analysis). Since ususally one is interested in reporting 
the highly similar pairs, the condition that A is greater than the 
average similarity J2i<i<j<n ^ihj)/{2) frequently true. (In 
fact, one could imagine that A would in many cases be much 
greater than the average similarity.) From the above we can 
obtain the following simple (in some cases pessimistic) upper 
bound on the time complexity: 

Corollary 3: If A is not chosen smaller than the average 
pairwise similarity, the expected time complexity of BiSam 

is Oimblogn + ^n^). 

This means that under the assumption of the corollary we 
win a factor of at least min(fo/log6, ^{^)^) compared to the 
exact counting approach. Note in this context that /i can be 
chosen as a small value (e.g., /i = 15 in our experiments). 
In most of our experiments the first of the two terms (the 
counting phase) dominated the time complexity. However, we 
also found that for some data sets with mainly low-support 
items, the second term dominated. If we let a = mb/n 
denote the average support, the speedup can be expressed as 
i7(6min(l/log6, ^)). That is, the second term dominates if 
the average support is below roughly /in/ log 6. 

b) Independent items: As further evidence for (or expla- 
nation of) why the time complexity of the second term may be 
close to linear, we consider an input where each item i appears 
in a given transaction with probability p., independently of all 
other items. Thus, the probability that distinct items i and j 
appear in a transaction is PiPj. We observe that each similarity 
measure s{i,j) in Fig. [T] with the exception of lift, satisfies 
sihi) < s{i,j), where s(i,j) = ''^[^f + ^75^- Thus, we 
get an upper bound on running time for these measures by 
considering the similarity measure s{i,j). Observe that the 
expected value of s{i,j) is pi +pj by linearity of expectation. 
Hence, the expected sum of similarities is: 

n n n n 

J2 J2 p^+pj < Yp'^ + Y ^p^ " 2" ■ 

This means that the running time of BiSam is indeed 
0{mblogn + n/A) for independent items. 

B. Running time in the I/O model 

We now present IOBiSam, an I/O efficient implementation 
of the BiSam algorithm. The rest of the paper can be read 
independently of this section. As before, we assume that the 
similarity measure is such that \Si D Sj\f{\Si\,\Sj\, A) = 
0{s[i,j)/A) 

In order to compute the support of each item, which means 
computing the ItemCount function, a sorting of the dataset's 
items is carried out. It is necessary to keep track of which 
transaction each item belongs to. To compute the sorted list 
of items, 0(^ logM jj) I/Os are needed |24| , where N — mb 
is the number of pairs c = (item. Transaction ID), M is the 
number of such pairs that fit in memory, and B is the number 



of pairs that fit in a memory page. When the items are sorted, it 
is trivial to compute the number of occurrences of each item, 
so it takes just O(^) I/Os to compute and store the tuples 
c(item,support, Transaction ID). In the following, let C be the 
set of such tuples written to disk. 

We then sort the tuples according to transaction ID, and 
secondarily according to support, again using logju jj) 
I/Os. This gives us each transaction in sorted order, according 
to item supports. Assuming that each transaction fits in main 
memorjj^ it is simple to determine which pairs satisfy the 
inequality f {c{Tt[i]) , c{Tt[j]) , A) fi > r. When a pair satisfies 
the inequality, it is buffered in an output page in memory, 
together with the item supports. Once the page is filled, it 
is flushed to external memory. The total cost of this phase 
is 0( ^g^ ) I/Os for the flushings and reads, where N is 
the total number of pairs satisfying the inequality (i.e., the 
number of samples taken). As before, the expected value of N' 
is Ei<.<j<„ Finally, we spend logA/ ^) 

I/Os to sort the sampled pairs (according to e.g. lexicographi- 
cal order). Then it is easy to compute M{i, j), i.e., the number 
of times each pair has been sampled by the algorithm, 
using 0{^) I/Os. The final step is to output all the pairs 
satisfying the condition: 

> m/2 or Af(i,i)/(c(i),c(j), A) > 1, 

which needs 0{^) I/Os. We observe that this cost is dom- 
inated by the cost of previous operations. The most expen- 
sive steps are the sorting steps, whose total input has size 
0{N + N'), implying that the following theorem holds: 

Theorem 4: Suppose we are given transactions Ti, . . . , T^, 
each a subset of [n], with N = mh items in total, and / 
is the function corresponding to the similarity measure s. 
Also assume ^ S^\j{\Si\,\Sj\, l\) = 0(s(i,j)/A). For 
N' = 0(^X]i<j<j<„ the expected complexity of 

IOBlSAM(ri, . . . , r„; /, II, A) in the I/O model is 



I/Os 



IV. Analysis of error probability 

False negatives. We first bound the probability that a pair 
with s{i,j) > A is not reported by the algorithm. This 
happens if M{iJ) < ^l/2 and j)/(c(z), c(j), A) < 1. 
If /(c(i), c(j), A)/i > 1 then the pair is reported 

with probability 1. Otherwise, since M{i,j) has binomial 
distribution, it follows from Chernoff bounds (see e.g. p5] 
Theorem 4.2] with 6 = 1/2) that the probability of the former 
event is at most exp(— (5^/i/2) — exp(— /i/8). Solving for 
/i this means that we have error probability at most e if 
/i > 81n(l/e). This bound is pessimistic, especially when 
e is not very small. Tighter bounds can be obtained using 
the Poisson approximation to the binomial distribution, which 
is known to be precise when the number of trials is not too 

-The assumption is made only for simplicity of exposition, since the result 
holds also without this assumption. 
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Fig. 5. Values of fi and corresponding error probabilities e. 
The error probabilities e' are for the variant of the algorithm 
where we return the whole multiset M, and use a different 
method to filter false positives (see Section [T^. 



small (e.g., at least 100). Fig. |5] shows some values of /i and 
corresponding false negative probabilities, using the Poisson 
approximation. 

False positives. The probability that a pair with 
s{i,j) < A is reported depends on how far the mean 
\Si n Sj\f{\S.i\,\Sj\, A)fi is from fj.. With the exception of 
the jaccard measure, all measures we consider have mean 
/is(i,j)/A. In the following we assume this is the case (a 
slightly more involved analysis can be made for the jaccard 
measure). If the ratio s{i,j)/A is close to 1, there is a high 
probability that the pair will be reported. However, this is 
not so bad since s{i,j) is close to the threshold A. On the 
other hand, when s{i,j)/ A is close to zero we would like the 
probability that is reported to be small. Again, we may 
use the fact that either f {c{i) , c{j) , A) jj, > 1 (in which case 
the pair is exactly counted and reported with probability 0), 
or M{i,j) has binomial distribution with mean s{i,j)^. For 
s{i,j) < A/2 we can use Chernoff bounds, or the Poisson 
approximation, to bound the probability that M{i,j) > /i/2. 
Fig. [6] illustrates two Poisson distributions (one corresponding 
to an item pair with measure three times below the threshold, 
and one corresponding to an item pair with measure at the 
threshold). 

Actually, the number /i/2 in the reporting loop of the 
BiSam algorithm is just one possible choice in a range of 
possible trade-offs between the number of false positives and 
false negatives. As an alternative to increasing this threshold, a 
post-processing procedure may efficiently eliminate most false 
positives by more accurately estimating the corresponding 
values of the measure. 

V. Variants and extensions 

In this section we mention a number of ways in which our 
results can be extended. 

A. Alternative filtering of false positives 

The threshold of /i/2 in the BiSam algorithms means that 
we filter away most pairs whose similarity is far from A. An 
alternative is to spend more time on the pairs E M, 

using a sampling method to obtain a more accurate estimate 
of \Si O Sj\. A suitable technique could be to use min-wise 



independent hash functions |26|, pT) to obtain a sketch of 



Poisson distributions with mean 5 and 15 

0,20 — 








Fig. 6. Illustration of false negatives and false positives for 
fi = 15. The leftmost peak shows the probability distribution 
for the number of samples of a pair with s{i,j) = A/3. 
With a probability of around 13% the number of samples is 
above the threshold (vertical line), which leads to the pair 
being reported (false positive). The rightmost peak shows the 
probability distribution for the number of samples of a pair 
(iyj) with s{i,j) = A. The probability that this is below the 
threshold, and hence not reported (false negative), is around 
1.8%. 



each set Si. It suffices to compare two sketches in order to 
have an approximation of the jaccard similarities of Si and 
Sj, which in turn gives an approximation of \Si H Based 
on this we may decide if a pair is Hkely to be interesting, or 
if it is possible to filter it out. The sketches could be built 
and maintained during the ItemCount procedure using, say, 
a logaritmic number of hash functions. Indyk |27| presents 
an efficient class of (almost) min-wise independent hash 
functions. 

For some similarity measures such as lift and overlap 
coefficient the similarity of two sets may be high even if the 
sets have very different sizes. In such cases, it may be better to 
sample the smaller set, say. Si, and use a hash table containing 
the larger set Sj to estimate the fraction \Si n S'j|/|S'i|. 

B. Reducing space usage by using counting Bloom filters 

At the cost of an extra pass over the data, we may reduce 
the space usage. The idea, previously found in e.g. [12) , is 
to initially create an approximate representation of M using 
counting Bloom filters (see p8) for an introduction). Then, in a 
subsequent pass we may count only those pairs that, according 
to the approximation, may occur at least /Lt/2 times. 

C. Weighted items 

Some appUcations of the cosine measure, e.g. in information 
retrieval, require the items to be weighted. BiSam easily 
extends to this setting. 

D. Adaptive variant. 

Instead of letting A be a user-defined variable, we may 
(informally) let A go from oo towards 0. This can be achieved 
by maintaining a priority queue of item pairs, where the 
priority reflects the value of A that would allow the pair to 



be sampled. Because / is non-increasing in afl parameters it 
suffices to have a linear number of pairs from each transaction 
in the priority queue at any time, namely the pairs that are 
"next in line" to be sampled. For each of the similarity 
measures in Fig. [TJthe value of A for a pair is easily 

computed by solving the equation f{\Si\,\Sj\,s)ii ~ r for 
s. Decreasing A corresponds to removing the pair with the 
maximum value from the priority queue. At any time, the set 
of sampled item pairs will correspond exactly to the choice 
of A given by the last pair extracted from the priority queue. 
The procedure can be stopped once sufficiently many results 
have been found. 

E. Composite measures 

Notice fliat if A ( I S*, I , I S'j I , A) and /2 (I S'i 1 , 1 5^ I , A) ai-e both 
non-increasing, then any linear combination afi + (3/2, where 
a,(3 > 0, is also non-increasing. Similarly, mm{afi, (3/2) is 
non-increasing. This allows us to use BiSam to directly search 
for pairs with high similarity according to several measures 
(corresponding to /i and /2), e.g., pairs with cosine similarity 
at least 0.7 and lift at least 2. 

VI. Experiments 

To make experiments fully reproducible and independent 
of implementation details and machine architecture, we focus 
our attention on the number of hash table operations, and the 
number of items in the hash tables. That is, the time for BiSam 
is the number of items in the input set plus the number of 
pairs inserted in the multiset A/. The space of BiSam is the 
number of distinct items (for support counts) plus the number 
of distinct pairs in M. Similarly, the time for methods based 
on exact counting is the number of items in the input set plus 
the number of pairs in all transactions (since every pair is 
counted), and the space for exact counting is the number of 
distinct items plus the number of distinct pairs that occur in 
some transaction. 

We believe that these simplified measures of time and 
space are a good choice for two reasons. First, hash table 
lookups and updates require hundreds of clock cycles unless 
the relevant key is in cache. This means that a large fraction 
of the time spent by a well-tuned implementation is used for 
hash table lookups and updates. Second, we are comparing 
two approaches that have a similar behavior in that they count 
supports of items and pairs. The key difference thus lies in the 
number of hash table operations, and the space used for hash 
tables. Also, this means that essentially any speedup or space 
reduction applicable to one approach is applicable to the other 
(e.g. using counting Bloom filters to reduce space usage). 

A. Data sets 

Experiments have been run on both real datasets and 
artificial ones. We have used most of the datasets of the 
Frequent Itemset Mining Implementations (FIMI) Repositorjj^ 
In addition, we have created three data sets based on the 
internet movie database (IMDB). 

^http : / / f imi . cs . helsinki . f i / 
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Fig. 7. Key figures on the data sets used for experiments. The first 13 data sets are from the FIMI repository. The last 
3 were extracted from the May 29, 2009 snapshot of the Internet Movie Database (IMDB). The datasets Chess, Connect, 
Mushroom, Pumsb, and Pumsb_star were prepared by Roberto Bayardo from the UCI datasets and PUMBS. Kosarak 
contains ( anonymized) click-stream data of a hungarian on-line news portal, provided by Ferenc Bodon. BMS-Web View-l, 
BMS-WebView-2, and BMS-POS contain clickstream and purchase data of a legwear and legcare web retailer, see ^29^ 
for details. Retail contains the (anonymized) retail market basket data from a Belgian retail store [30]. Accidents contains 
(anonymized) traffic accident data ^37| /. The datasets T10I4D100K and T10I4D100K have been generated using an IBM 
generator from the Almaden Quest research group. Actors contains the set of rated movies for each male actor who has acted 
in at least 10 rated movies. DirectorActor contains, for each director who has directed at least 10 rated movies, the set of 
actors from Actors that this director has worked with in rated movies. MovieActor is the inverse relation of Actors, listing 
for each movie a set of actors. 



Fig. [t] contains some key figures on the data sets. 

B. Results and discussion 

Fig. [8] shows the results of our experiments for the co- 
sine measure. The time and space for BiSam is a random 
variable. The reported number is an exact computation of 
the expectation of this random variable. Separate experiments 
have confirmed that observed time and space is relatively well 
concentrated around this value. The value of A used is also 
shown — it was chosen manually in each case to give a 
"human readable" output of around 1000 pairs. (For the IMDB 
data sets and the Kosarak data set this was not possible; for 
the latter this behaviour was due to a large number of false 
positives.) Note that choosing a smaller A would bring the 
performance of BiSam closer to the exact algorithms; this is 
not surprising, since lowering A means reporting pairs having 
a smaller similarity measure, increasing in this way the number 
of samples taken. As noted before, we are usually interested in 
reporting pairs with high similarity, for almost any reasonable 
scenario. 

The results for the other measures are omitted for space 
reasons, since they are very similar to the ones reported here. 
This is because the complexity of BiSam is, in most cases, 
dominated by the first phase (counting item frequencies). 



meaning that fluctuations in the cost of the second phase have 
little effect. This also suggests that we could increase the value 
of fi (and possibly increase the value of the threshold /i/2 
used in the BiSam algorithm) without significantly changing 
the time complexity of the algorithm. 

We see that the speedup obtained in the experiments varies 



between a factor 2 and a factor over 30. Figures 9(a) and 9(b) 
give a graphical overview. The largest speedups tend to come 
for data sets with the largest average transaction size, or data 
sets where some transactions are very large (e.g. Kosarak). 
However, as our theoretical analysis suggests, large transaction 
size alone is not sufficient to ensure a large speedup — items 
also need to have support that is not too small. So while the 
DirectorActor data set has very large average transaction size, 
the speedup is only moderate because the support of items is 
low. In a nutshell, BiSam gives the largest speedups when 
there is a combination of relatively large transactions and 
relatively high average support. The space usage of BiSam 
ranges from being quite close to the space usage for exact 
counting, to a decent reduction. 

Though we have not experimented with methods based 
on locality-sensitive hashing (LSH), we observe that our 
method appears to have an advantage when the number n 
of distinct items is large. This is because LSH in general 



(and in particular for cosine similarity) requires comparison of 
(2) pairs of hash signatures. For the data sets Retail, BMS- 
Webview-2, Actors, and MovieActors the ratio between the 
number of signature comparisons and the number of hash table 
operations required for BiSam is in the range 9-265. While 
these numbers are not necessarily directly comparable, it does 
indicate that BiSam has the potential to improve LSH-based 
methods that require comparison of all signature pairs. 

VII. Conclusion 

We have presented a new sampling-based method for finding 
associations in data. Besides our initial experiments, indicating 
that large speedups may be obtained, there appear to be many 
opportunities for using our approach to implement association 
mining systems with very high performance. Some such 
opportunities are outlined in Section [V] but many nontrivial 
aspects would have to be considered to do this in the best way. 
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Fig. 8. Result of experiments for the cosine measure and /U = 15 (which gives false negative probability 1.8%). #output is the 
number of pairs of items reported; A is the threshold for "interesting similarity." DirectorsActor lacks the output because of 
the huge number of pairs. 




Time for BiSam Space for BiSam 



(a) Comparison of the time for BiSam and for exact count- 
ing in all experiments. The line is the identity function. 
Typical difference is about an order of magnitude. 



(b) Comparison of the space for BiSam and for exact 
counting in all experiments. The line is the identity function. 



Fig. 9. Space and time comparisons 
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